This manual will describe the standardized T1-weighted image quality control (QC) used in the TIGR Lab. This QC protocol was used for the publication of “name of paper”. The QC procedure used has two steps: a qualitative and quantitative QC procedure, as described below. Failing either visual or quantitative QC should result in exclusion.
QC decisions should be finalized by the researcher using the data for their respective papers, however, if more than one person is using the same dataset, it is recommended that they use the same QC ratings to improve within-lab replicability. In this case, 2 lab members should perform QC on the same dataset to evaluate inter-rater reliability. If this is not possible due to a very large sample size, then it is recommended that 2 people QC a subset of the same dataset to ensure that their ratings are similar.
The fMRIprep output HTML files will be used for visual QC. Considering the sample size of many of the studies in the lab, using the fMRIprep output provides an efficient and accurate way to make QC decisions. This step requires a csv file to note artefacts, record comments and detail final QC decisions. Here is a link to the template as well as a visual example of the document:
The fMRIprep pipeline outputs 3 major visual sections that are used for QC, which makes it possible to scroll through one HTML page per participant:
The overall rating system structure consists of a 5 point scale, where 1 indicates the least problematic artefact and a 5 indicates the most problematic artefact. The chart below details example criteria for each rating.
The brain mask image indicates whether the pipeline correctly differentiated between brain tissue and the skull. The primary objective is to check whether the brain extraction was accurate. The mask should follow the outline the brain. It should not exclude any brain tissue nor should it include non-brain tissue such as dura mater or the skull. Normally, an incorrect brain mask is the result of a poor quality scan, however, if the quality of the scan is good and the brain mask is not accurate then the data for that participant should be re-processed. The brain mask section does not need to have a specific rating, but comments, if necessary, should be noted.
In this example, the difference between the pass and fail is very clear. There will be a few edge cases where the mask may miss a bit of brain tissue or capture non-brain regions. These cases should be noted and if the participant’s other scans seem fine (i.e., they don’t have artefacts), they should be re-processed and re-assessed.
The next set of brain images depict the overlay of the brain slices from each participant to MNI space (which is an average of 20+ healthy adults brain). This overlay can help provide a comparison between the quality of the brain scan for each participant and an averaged good quality brain. Poor quality images are often easily identifiable. Common artefacts include ringing, ghosting or truncated brain regions (see below).
Deciphering good from bad quality scans often isn’t challenging, but most clinical datasets (especially pediatric samples) will have many edge cases, that is, brain images that aren’t of good quality but not too bad either. It is for this reason that the artefact rating system structure includes 5 ratings. There may be many cases where you are unsure which rating is appropriate. A general pattern is that if they have ringing artefacts across large regions of the brain (so for example, the entire cortex has visible rings) or if they have non-intensity uniformity, then they should be scored as a ‘moderate’ scan. Examples of edge cases are included below.
This section is arguably the most important when it comes to making QC decisions. Poor segmentation of the pial and surface layer is often indicative of poor scans. Segmentation is assessed based on whether the surface and pial layers properly differentiate the gray and white matter. If the brain scan is of poor quality and contains artefacts, then the segmentation will likely be incorrect, and this participant should be excluded. Importantly, if there are minor artefacts but the segmentation is good, then that participant could be included in the analysis (they would likely get a rating score of 3).
If the segmentation is not accurate but otherwise, the scan is good, it is recommended that you re-process that participant. If the segmentation is still not accurate, then it is recommended that you use ‘control point’ (a link on how to do this: ) provided by Freesurfer to fix this. Importantly, if there are minor-moderate artefacts (such as ringing), but the segmentation is otherwise accurate, then this participant should not be excluded and instead given a rating of 3 (moderate). Once all scans are rated, the participants should be grouped into pass, fail and doubtful based on the score they received.
Ultimately, visual QC decisions are subjective in nature and it’s challenging to achieve decent inter and intra-rater reliability. The effects of poor quality images are beginning to be well-documented in brain imaging research. To account for this, we recommend running all the analysis in the main cohort (which includes the pass and doubtful group) and an additional sensitivity analysis in the ‘pass’ group only. This group will only include good quality scans and remove the edge cases. Running all the analysis on the ‘pass’ group (called the stringent cohort), attempts to ensure that any potential issues in brain quality are accounted for. One caveat to this sensitivity analysis is that you will double all your analysis, and as such, need to correct for double the comparisons.
Relying solely on a visual QC can introduce subjective biases, such as decision fatigue. It also increases the possibility of missing important issues in the scan. One way to increase objectivity in the QC procedure is to introduce quantitative image quality metrics that can be analyzed.
Using the MRIQC pipeline, there are 3 metrics that provide information of the quality of brain images:
1. Contrast to noise ratio: High CNR indicates good contrast between different brain tissues.
2. Signal to noise ratio: The signal in a voxel divided by the standard deviation of the signal across a region outside the brain. High SNR means a good signal with low noise.
3. Coefficient of joint variation: Higher values are linked to more head motion and intensity non-uniformity artefacts
These three metrics are provided by the MRIQC pipeline in a csv file. It is recommended that each metric is graphed in a distribution chart. Depending on the distribution of these values, it is recommended that subjects that exceed 2 standard deviations from the mean should be excluded. If this excludes a high percentage of the population (>10%), then exclude only those who exceed 2.5 or 3 standard deviations. Plotting the distribution will be helpful in this case.
The last step recommended for QC is to either have someone else QC the same dataset, and/or perform an automated QC to compare the initial results. The optimal solution is to have both a second rater and to perform the automated QC (we recommend the MRIQC’s random forest classifier, since 3 of the 14 image quality metrics are used in the previous step). The automated classifier provided by MRIQC uses the 14 image quality metrics and provides a binary prediction of whether each participant should be included or excluded (link:https://mriqc.readthedocs.io/en/stable/cv/base.html). These are not the results you should use for your analysis, but it provides a more objective check of the exclusion in your dataset. If many of the visual scores are incongruent with the automated scores, then re-checking the visual ratings of those participants is recommended. Whether there is a second rater or the automated classifier is used, an inter-rater reliability test should be performed.
Here are a few example images from the POND dataset; they include two examples for each good, doubtful and excluded participants. It is important to remember that raters may have differing opinions and that is inevitable; the goal is to identify the least noisy data to analyze.
These two examples are clear good quality images:
Decisions based on these images can be controversial; some raters may want to include them in the analyses, and others may think they should be excluded. The trade-off that’s important to consider is that we don’t want to drastically reduce the variability in the data, nor do we want to introduce excess noise. By including these images in the “main” cohort but not “stringent” cohort, we hope to find the balance of this trade off.
A few things to note when looking through this example:
This second example is worse quality than the first - some may prefer to exclude this image. A few things to note:
This example doesn’t often occur – a simple case of a bad quality scan due to severe head motion in the scanner. The brain mask fails to capture several brain regions, there are heavy rings and truncated brain regions in the T1-MNI registration, and the segmentation is very poor. These are the least controversial images to exclude.
This example is representative of a typical “poor quality scan”:
Qualitative or quantitative QC will never be perfect and it’s common to have disagreements of which scans should be included or excluded. One strategy is to practice QC on 30-50 participants from an independent sample and compare those ratings with the person who QCed that sample. This way, we can hope to get closer to a more standardized procedure.